Sums of Squares and Hypothesis Testing

Dr. Lucy D’Agostino McGowan

Understanding Sums of Squares

What Are We Measuring?

The Big Picture: How well does our model explain the data?

Three key quantities:
- Total variation in the data
- Variation explained by our model
- Variation left unexplained (residuals)

The Linear Model Reminder

Our model: \(\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}\)

Fitted values: \(\hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{H}\mathbf{y}\)

where \(\mathbf{H} = \mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\) is the “hat matrix”

Residuals: \(\hat{\boldsymbol{\varepsilon}} = \mathbf{y} - \hat{\mathbf{y}}\)

Total Sum of Squares (TSS)

What is TSS?

Definition: Total variation in the response variable \(\mathbf{y}\)

Formula: \[\text{TSS} = \sum_{i=1}^n (y_i - \bar{y})^2\]

What is TSS?

Definition: Total variation in the response variable \(\mathbf{y}\)

Matrix form: \[\text{TSS} = (\mathbf{y} - \bar{y}\mathbf{1})^T(\mathbf{y} - \bar{y}\mathbf{1})\]

where \(\mathbf{1}\) is a vector of ones and \(\bar{y}\) is the sample mean

TSS Intuition

Think of TSS as: “How spread out are my y-values?”

Key insight: This is what we’re trying to explain with our model

Alternative matrix form: \[\text{TSS} = \mathbf{y}^T\mathbf{y} - n\bar{y}^2\]

You Try: Calculate TSS

Given: \(\mathbf{y} = \begin{bmatrix} 2 \\ 4 \\ 6 \\ 8 \end{bmatrix}\)

Find: TSS using both the definition and matrix form

03:00

You Try: Setup

Step 1: Calculate \(\bar{y} = \frac{2+4+6+8}{4} = 5\)

Step 2: Deviations from mean: \(\begin{bmatrix} -3 \\ -1 \\ 1 \\ 3 \end{bmatrix}\)

You Try: Solution

Method 1 (definition): \[\text{TSS} = (-3)^2 + (-1)^2 + 1^2 + 3^2 = 9 + 1 + 1 + 9 = 20\]

Method 2 (matrix form): \[\text{TSS} = 2^2 + 4^2 + 6^2 + 8^2 - 4(5^2) = 120 - 100 = 20\]

Sum of Squared Errors (SSE)

What is SSE?

Definition: Variation left unexplained by our model

Formula: \[\text{SSE} = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum_{i=1}^n \hat{\varepsilon}_i^2\]

What is SSE?

Definition: Variation left unexplained by our model

Matrix form: \[\text{SSE} = \hat{\boldsymbol{\varepsilon}}^T\hat{\boldsymbol{\varepsilon}} = (\mathbf{y} - \hat{\mathbf{y}})^T(\mathbf{y} - \hat{\mathbf{y}})\]

SSE with Hat Matrix

Since \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\): \[\text{SSE} = (\mathbf{y} - \mathbf{H}\mathbf{y})^T(\mathbf{y} - \mathbf{H}\mathbf{y})\]

Factor out \(\mathbf{y}\): \[= \mathbf{y}^T(\mathbf{I} - \mathbf{H})^T(\mathbf{I} - \mathbf{H})\mathbf{y}\]

SSE with Hat Matrix

Key property: \(\mathbf{I} - \mathbf{H}\) is symmetric and idempotent
\[\text{SSE} = \mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y}\]

SSE Intuition

Think of SSE as: “How much did we miss with our model?”

Smaller SSE = Better fit
Larger SSE = Worse fit

Perfect fit: SSE = 0 (model explains everything)

Model Sum of Squares (\(\textrm{SS}_\textrm{Reg}\))

What is Regression Sum of Squares?

Definition: Variation explained by our model

Formula: \[\text{SS}_{\text{Reg}} = \sum_{i=1}^n (\hat{y}_i - \bar{y})^2\]

What is Regression Sum of Squares?

Definition: Variation explained by our model

Matrix form: \[\text{SS}_{\text{Reg}} = (\hat{\mathbf{y}} - \bar{y}\mathbf{1})^T(\hat{\mathbf{y}} - \bar{y}\mathbf{1})\]

Regression SS with Hat Matrix

Since \(\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\): \[\text{SS}_{\text{Reg}} = (\mathbf{H}\mathbf{y} - \bar{y}\mathbf{1})^T(\mathbf{H}\mathbf{y} - \bar{y}\mathbf{1})\]

Alternative form: \[\text{SS}_{\text{Reg}} = \mathbf{y}^T\mathbf{H}\mathbf{y} - n\bar{y}^2\]

Regression Sum of Squares Intuition

Think of \(\textrm{SS}_\textrm{Reg}\) as: “How much variation did our model capture?”

Larger \(\textrm{SS}_\textrm{Reg}\) = Model explains more
Smaller \(\textrm{SS}_\textrm{Reg}\) = Model explains less

No model: \(\textrm{SS}_\textrm{Reg}\) = 0 (just predicting the mean)

The Fundamental Identity

It all adds up!

The key relationship: \[\text{Total Variation} = \text{Unexplained} + \text{Explained}\]

You Try: Prove the Identity

Challenge: Show that TSS = SSE + \(\textrm{SS}_\textrm{Reg}\) using matrix algebra

Hint: Start with the definitions and use properties of the hat matrix

05:00

Proof Strategy

Start with: TSS = \(\mathbf{y}^T\mathbf{y} - n\bar{y}^2\)

Key insight: We need to decompose \(\mathbf{y}\) as \(\mathbf{y} = \hat{\mathbf{y}} + \hat{\boldsymbol{\varepsilon}}\)

Then show: Cross terms vanish due to orthogonality

Proof: Step by Step

Write: \(\mathbf{y} = \hat{\mathbf{y}} + \hat{\boldsymbol{\varepsilon}} = \mathbf{H}\mathbf{y} + (\mathbf{I}-\mathbf{H})\mathbf{y}\)

Then: \[\mathbf{y}^T\mathbf{y} = (\mathbf{H}\mathbf{y} + (\mathbf{I}-\mathbf{H})\mathbf{y})^T(\mathbf{H}\mathbf{y} + (\mathbf{I}-\mathbf{H})\mathbf{y})\]

Expand the Product

Four terms: \[= \mathbf{y}^T\mathbf{H}^T\mathbf{H}\mathbf{y} + \mathbf{y}^T\mathbf{H}^T(\mathbf{I}-\mathbf{H})\mathbf{y}\] \[+ \mathbf{y}^T(\mathbf{I}-\mathbf{H})^T\mathbf{H}\mathbf{y} + \mathbf{y}^T(\mathbf{I}-\mathbf{H})^T(\mathbf{I}-\mathbf{H})\mathbf{y}\]

Cross Terms Vanish

Key property: \(\mathbf{H}(\mathbf{I}-\mathbf{H}) = \mathbf{0}\) (orthogonal projections)

Therefore: The cross terms equal zero

Result: \[\mathbf{y}^T\mathbf{y} = \mathbf{y}^T\mathbf{H}\mathbf{y} + \mathbf{y}^T(\mathbf{I}-\mathbf{H})\mathbf{y}\]

Final Identity

Subtract \(n\bar{y}^2\) from both sides: \[\text{TSS} = \text{SS}_{\text{Reg}} + \text{SSE}\]

Beautiful result: Total variation splits perfectly into explained and unexplained parts

Coefficient of Determination (R²)

What is R²?

Definition: Proportion of total variation explained by the model

Formula: \[R^2 = \frac{\text{SS}_{\text{Reg}}}{\text{TSS}} = 1 - \frac{\text{SSE}}{\text{TSS}}\]

Range: \(0 \leq R^2 \leq 1\)

R² Interpretation

R² = 0.8 means “80% of variation is explained by the model”

Perfect fit: R² = 1 (model explains everything)
No relationship: R² = 0 (model explains nothing)

Warning: High R² doesn’t always mean good model!

You Try: Calculate R²

Given: TSS = 100, SSE = 25

Find: R² and interpret the result

02:00

You Try: Solution

Method 1: \[R^2 = 1 - \frac{\text{SSE}}{\text{TSS}} = 1 - \frac{25}{100} = 0.75\]

Method 2: \[\text{SS}_{\text{Reg}} = \text{TSS} - \text{SSE} = 100 - 25 = 75\] \[R^2 = \frac{75}{100} = 0.75\]

Interpretation: The model explains 75% of the variation in y

The Danger of R²: Anscombe’s Quartet

When R² Misleads

# Load the famous Anscombe's Quartet
data(anscombe)

# All have same R²!
lm1 <- lm(y1 ~ x1, data = anscombe)
lm2 <- lm(y2 ~ x2, data = anscombe) 
lm3 <- lm(y3 ~ x3, data = anscombe)
lm4 <- lm(y4 ~ x4, data = anscombe)

c(summary(lm1)$r.squared, summary(lm2)$r.squared, 
  summary(lm3)$r.squared, summary(lm4)$r.squared)
[1] 0.6665425 0.6662420 0.6663240 0.6667073

Visual Reality Check

Hypothesis Testing in Matrix Form

The Big Questions

Single predictor: Is \(\beta_1\) significantly different from 0?

Multiple predictors: Are any of the predictors useful?

Subset test: Is a group of predictors jointly significant?

Testing Single Coefficients

Null hypothesis: \(H_0: \beta_j = 0\)
Alternative: \(H_1: \beta_j \neq 0\)

Test statistic: \[t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)}\]

Standard error: \(\text{se}(\hat{\beta}_j) = \sqrt{\hat{\sigma}^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}}\)

Where Does the Standard Error Come From?

Recall: \(\text{Var}(\hat{\boldsymbol{\beta}}) = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1}\)

For individual coefficient j: \[\text{Var}(\hat{\beta}_j) = \sigma^2[(\mathbf{X}^T\mathbf{X})^{-1}]_{jj}\]

Estimate \(\sigma^2\): \(\hat{\sigma}^2 = \frac{\text{SSE}}{n-p}\) where p = number of parameters

t-Distribution Result

Under normality assumption: \[\frac{\hat{\beta}_j - \beta_j}{\text{se}(\hat{\beta}_j)} \sim t_{n-p}\]

t-Distribution Result

For testing \(H_0: \beta_j = 0\): \[t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)} \sim t_{n-p}\]

Overall F-Test

Testing Multiple Predictors

Null hypothesis: \(H_0: \beta_1 = \beta_2 = \cdots = \beta_{p-1} = 0\)

Alternative: At least one \(\beta_j \neq 0\) (j ≠ 0)

This tests: “Is the model useful at all?”

F-Test Statistic

Test statistic: \[F = \frac{\text{SS}_{\text{Reg}}/(p-1)}{\text{SSE}/(n-p)} = \frac{\text{Mean Square Regression}}{\text{Mean Square Error}}\]

Under \(H_0\): \(F \sim F_{p-1, n-p}\)

F-Test Intuition

Numerator: How much variation does model explain per parameter?

Denominator: How much unexplained variation per residual degree of freedom?

Large F: Model explains a lot relative to noise
Small F: Model doesn’t explain much more than noise

Connection to R²

Alternative F-statistic form: \[F = \frac{R^2/(p-1)}{(1-R^2)/(n-p)}\]

This shows: F-test is really testing whether R² is significantly different from 0

General Linear Hypothesis Testing

The Framework

General form: \(H_0: \mathbf{C}\boldsymbol{\beta} = \mathbf{d}\)

Where:
- \(\mathbf{C}\) is a contrast matrix (q × p)
- \(\mathbf{d}\) is a vector of constants
- q is the number of restrictions

Examples of Linear Hypotheses

Single coefficient: \(\mathbf{C} = [0, 1, 0, 0]\), \(\mathbf{d} = 0\)
Tests: \(\beta_1 = 0\)

Overall test: \(\mathbf{C} = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\), \(\mathbf{d} = \mathbf{0}\)
Tests: All slope coefficients = 0

Equality test: \(\mathbf{C} = [0, 1, -1, 0]\), \(\mathbf{d} = 0\)
Tests: \(\beta_1 = \beta_2\)

The General F-Test

Test statistic: \[F = \frac{(\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})^T[\mathbf{C}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{C}^T]^{-1}(\mathbf{C}\hat{\boldsymbol{\beta}} - \mathbf{d})/q}{\text{SSE}/(n-p)}\]

Under \(H_0\): \(F \sim F_{q, n-p}\)

You Try: Set Up Hypothesis Test

Scenario: Three predictors, want to test if the last two coefficients are both zero

Model: \(y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \varepsilon\)

Question: What are \(\mathbf{C}\) and \(\mathbf{d}\) for \(H_0: \beta_2 = \beta_3 = 0\)?

03:00

You Try: Solution

Answer: \[\mathbf{C} = \begin{bmatrix} 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}, \quad \mathbf{d} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\]

Check: \(\mathbf{C}\boldsymbol{\beta} = \begin{bmatrix} \beta_2 \\ \beta_3 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix}\)

Putting It All Together

The ANOVA Table

Source df Sum of Squares Mean Square F
Regression p-1 \(\textrm{SS}_\textrm{Reg}\) \(\textrm{MS}_\textrm{Reg}\) \(\textrm{MS}_\textrm{Reg}\)/MSE
Error n-p SSE MSE
Total n-1 TSS

Matrix Summary

Key relationships:
- TSS = \(\mathbf{y}^T\mathbf{y} - n\bar{y}^2\)
- \(\textrm{SS}_\textrm{Reg}\) = \(\mathbf{y}^T\mathbf{H}\mathbf{y} - n\bar{y}^2\)
- SSE = \(\mathbf{y}^T(\mathbf{I} - \mathbf{H})\mathbf{y}\)
- TSS = \(\textrm{SS}_\textrm{Reg}\) + SSE

R² = \(\textrm{SS}_\textrm{Reg}\)/TSS = 1 - SSE/TSS

Testing Hierarchy

Step 1: Overall F-test (is model useful?)

Step 2: Subset F-tests (are groups of predictors significant?)

Step 3: Individual t-tests (which predictors matter?)

Understanding P-values in Regression

What is a P-value?

Definition: The probability of observing a test statistic as extreme or more extreme than what we observed, assuming the null hypothesis is true

In other words: “How surprising is our result if \(H_0\) were true?”

P-values for t-tests

For testing \(H_0: \beta_j = 0\):

Test statistic: \(t = \frac{\hat{\beta}_j}{\text{se}(\hat{\beta}_j)}\)

Two-sided p-value:

\[\text{p-value} = P(|T| \geq |t|) = 2 \times P(T \geq |t|)\] where \(T \sim t_{n-p}\)

P-values for F-tests

For overall test \(H_0:\) all slopes = 0:

Test statistic: \(F = \frac{\text{SS}_{\text{Reg}}/(p-1)}{\text{SSE}/(n-p)}\)

One-sided p-value:

\[\text{p-value} = P(F_{p-1,n-p} \geq f)\] where \(f\) is our observed F-statistic

You Try: Interpret P-values

Scenario: Testing \(H_0: \beta_1 = 0\) with t = 2.8 and df = 18

Given: p-value = 0.012

Questions:

  • What does this p-value mean?
  • Would you reject \(H_0\) at α = 0.05?
  • Would you reject \(H_0\) at α = 0.01?
03:00

You Try: Solution

Interpretation: If \(\beta_1 = 0\) were true, there’s only a 1.2% chance of seeing a t-statistic as extreme as ±2.8 or more extreme

At α = 0.05: Reject \(H_0\) (p = 0.012 < 0.05) - Evidence suggests \(\beta_1 \neq 0\)

At α = 0.01: Fail to reject \(H_0\) (p = 0.012 > 0.01) - Not enough evidence at this stricter level

Computing P-values in R

The Key Functions

For t-tests: Use pt() function

For F-tests: Use pf() function

Both give: Cumulative distribution function (CDF) values

t-test P-values with pt()

Two-sided test: \(H_0: \beta_j = 0\)

# Example: t = 2.8, df = 18
t_stat <- 2.8
df <- 18

# Two-sided p-value
p_value_t <- 2 * (1 - pt(abs(t_stat), df))
p_value_t
[1] 0.01183672

t-test P-values with pt()

Why the formula?

  • pt(2.8, 18) gives P(T ≤ 2.8)
  • 1 - pt(2.8, 18) gives P(T > 2.8)
  • Multiply by 2 for both tails

F-test P-values with pf()

One-sided test: \(H_0:\) all slopes = 0

# Example: F = 12.44, df1 = 3, df2 = 16
f_stat <- 12.44
df1 <- 3  # numerator df (p-1)
df2 <- 16 # denominator df (n-p)

# One-sided p-value
p_value_f <- 1 - pf(f_stat, df1, df2)
p_value_f
[1] 0.0001879013
# One-sided p-value
p_value_f <- pf(f_stat, df1, df2, lower.tail = FALSE)
p_value_f
[1] 0.0001879013

F-test P-values with pf()

Why one-sided? F-statistics are always ≥ 0, so we only care about the right tail

You Try: Calculate P-values

Given: - t-statistic = -1.96, df = 24 - F-statistic = 8.5, df1 = 2, df2 = 20

Calculate both p-values using R

03:00

You Try: Solution

# t-test p-value (two-sided)
t_val <- -1.96
df_t <- 24
p_t <- 2 * pt(abs(t_val), df_t, lower.tail = FALSE)
cat("t-test p-value:", round(p_t, 4))
t-test p-value: 0.0617
# F-test p-value (one-sided)
f_val <- 8.5
df1_f <- 2
df2_f <- 20
p_f <- pf(f_val, df1_f, df2_f, lower.tail = FALSE)
cat("\nF-test p-value:", round(p_f, 4))

F-test p-value: 0.0021

Interpretation: - t-test: p = 0.0614 (not significant at α = 0.05) - F-test: p = 0.0021 (significant at α = 0.05)

P-value Cautions

What p-values DON’T tell us:

  • How large or important the effect is
  • Whether the relationship is causal
  • Whether the model assumptions are met

Remember: Statistical significance ≠ practical significance

Best practice: Report both p-values AND effect sizes (\(\hat{\beta}_j\))

Key Takeaways

P-values help us:

  • Quantify evidence against \(H_0\)
  • Make consistent decisions about significance
  • Communicate strength of statistical evidence

But remember:

  • Context matters more than arbitrary cutoffs
  • Consider effect size alongside significance
  • Check model assumptions first

Final You Try: Complete Analysis

Given: n = 20, p = 4, TSS = 1000, SSE = 300

Calculate:
- R²
- Overall F-statistic
- Conclude at α = 0.05 level

05:00

Final Solution

R² calculation: \[R^2 = 1 - \frac{SSE}{TSS} = 1 - \frac{300}{1000} = 0.7\]

F-statistic: \[F = \frac{SS_{Reg}/(p-1)}{SSE/(n-p)} = \frac{700/3}{300/16} = \frac{233.33}{18.75} = 12.44\]

Critical value: \(F_{0.05, 3, 16} = 3.24\)

Conclusion: Reject \(H_0\). The model is statistically significant.